please run "all.sh" before executing this notebook!

Business Understanding

salary encoded here as ConvertedComp - its in USD

The data of stackoverflow contains a lot of columns (61 in total). Most of the data is a string of a set of possible answers. Only a small amount contains floating point values. In total, we have 64461 answers, where about 53.9% (34756) contain an answer related to their current job salary, less than we got compared to the job satisfaction we analysed at the beginning of this year (70%).

We have a total 2997 unique values. But this values can be references as 'Monthly', 'Weekly' and 'Yearly' - that needs to be considered then doing the data preparation.

Section 3: Data Preparation

Selection

All rows that don't have a salary defined, will where removed from the dataset. I removed as well all data points that are by 3 std-dev away from the mean, resulting in the drop of 3241 records.

Construct

I created new columns out of the existings ones in order to extract the features we want to analyse. This was done for all columns.

Constructed data (and their analysis)

Numeric Values

When we have a column with a numeric value, we can directly use it as a feature. No scaling is done as I assume that the NN is able to handle the unscaled data correctly. If a value is nan, we extract it in a new column naming it <column_name>_NA. This prevents the deletion of the entire row only because (some) information is missing. Nan column are not shown here.

Hours of work by week

This column reflect the number of hours worked.

Scaled output of the distrubution. We still have a imbalanced distribution even after removing the outliers. I would as well consider 94 hours of work a really large value.

Age

This column reflect the number of age of the user.

Most participants are about 25 year old. After Scaling and removing the outliers, we have a max value for age at 59. The first bucket is the biggest one, assuming most people are really young when visiting stack overflow.

Age 1st Code

This column reflect the Age that the user was, as they started to code.

A more gausian distribution here - most people start at 15 years with coding.

YearsCode

This column reflect the number of Years that the user code.

Most users have 10 years of coding experience in general.

YearsCodePro

This column reflect the number of years of coding in a professional environment.

Most of the users have a quite low amount of coding experience in a professional environment. This can be an indicator that they are more likley to use stack overflow than a more experience developer - they can read up directly the documentation of the particular framework.

Categorical Features

This will give a inside on the categorical data we found. Some answers are exclusive, for example only one company size could be seleted. But for some other answers, entire sets could be selected, like in the used programming languages. Both sources where one-hot encoded.

CompFreq

This column reflect the payment frequency.

19107 yearly salary information, 14680 Monthly and 969 Weekly. Most don't respond to the answer.

We have a lot on countries with a low amount of responses. Its possible that they get removed when we split on train and test set. 12k answers from the USA than with fast dropping distributions.

Label

As we want to predict the salary, we use the ConvertedComp column in the data. The payment frequency and the different currencies have been converted to one base and allow a better comparision. From the stack overflow schema file:

ConvertedComp,"Salary converted to annual USD salaries using the exchange rate on 2020-02-19, assuming 12 | working months and 50 working weeks."

Most have a low annual income with 10k. But this may be a lot amount of money, depending on the geographical position of the user. After 200k the amount of items in the bucket gets low - only a few users have such high amount of income. When looking on the removed outlier data above, the max values gets moved to 580k - still a lot of money. In the past I may consider removing even more outliers.

Modelling

To reduce the required memory (and the amount of features the user have to provide for a prediction), I trained on each column a simple regression model using tensorflow 2.0 (with keras). Then the best 15 models where combined together to produce the final model (see shared/train.py in the repository for details).

Evaluation

I used Mean Squared Error as metric to compare the models. The best single model used the country feature - not a big suprise, as most of the data was based for US people and the the salary should be high their as well.

The order of the quality of the features, top 15, sorted by mean squared error on the validation (test) set. This is the same order the user have to provide information for a custom prediction.

Deployment

To have this time something to deploy as well, I packed the keras model into tensorflow js and serve it via tensorflow js on my blog. Its all rendered and predicted in the browser, to make sure the user is not scared in putting (sensitive) data into application.